street-view image
From Satellite to Street: A Hybrid Framework Integrating Stable Diffusion and PanoGAN for Consistent Cross-View Synthesis
Bajbaa, Khawlah, Anwar, Abbas, Saqib, Muhammad, Anwar, Hafeez, Sharma, Nabin, Usman, Muhammad
Street view imagery has become an essential source for geospatial data collection and urban analytics, enabling the extraction of valuable insights that support informed decision-making. However, synthesizing street-view images from corresponding satellite imagery presents significant challenges due to substantial differences in appearance and viewing perspective between these two domains. This paper presents a hybrid framework that integrates diffusion-based models and conditional generative adversarial networks to generate geographically consistent street-view images from satellite imagery. Our approach uses a multi-stage training strategy that incorporates Stable Diffusion as the core component within a dual-branch architecture. To enhance the framework's capabilities, we integrate a conditional Generative Adversarial Network (GAN) that enables the generation of geographically consistent panoramic street views. Furthermore, we implement a fusion strategy that leverages the strengths of both models to create robust representations, thereby improving the geometric consistency and visual quality of the generated street-view images. The proposed framework is evaluated on the challenging Cross-View USA (CVUSA) dataset, a standard benchmark for cross-view image synthesis. Experimental results demonstrate that our hybrid approach outperforms diffusion-only methods across multiple evaluation metrics and achieves competitive performance compared to state-of-the-art GAN-based methods. The framework successfully generates realistic and geometrically consistent street-view images while preserving fine-grained local details, including street markings, secondary roads, and atmospheric elements such as clouds.
- Asia > Middle East > Saudi Arabia > Eastern Province > Dhahran (0.14)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > Michigan > Wayne County > Detroit (0.04)
- (5 more...)
- Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.55)
- Transportation > Ground > Road (0.48)
AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models
Xu, Shixiong, Zhang, Chenghao, Fan, Lubin, Zhou, Yuan, Fan, Bin, Xiang, Shiming, Meng, Gaofeng, Ye, Jieping
Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level, but they struggle with fine-grained street-level localization within urban areas. In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images. A key challenge is that the street-view visual question-and-answer (VQA) data provides only microscopic visual cues, leading to subpar performance in fine-tuned models. To tackle this issue, we incorporate perspective-invariant satellite images as macro cues and propose cross-view alignment tuning including a satellite-view and street-view image grafting mechanism, along with an automatic label generation mechanism. Then LVLM's global understanding of street distribution is enhanced through cross-view matching. Our proposed model, named AddressVLM, consists of two-stage training protocols: cross-view alignment tuning and address localization tuning. Furthermore, we have constructed two street-view VQA datasets based on image address localization datasets from Pittsburgh and San Francisco. Qualitative and quantitative evaluations demonstrate that AddressVLM outperforms counterpart LVLMs by over 9% and 12% in average address localization accuracy on these two datasets, respectively.
- North America > United States > California > San Francisco County > San Francisco (0.24)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
- Asia > China > Beijing > Beijing (0.04)
- Europe > Spain > Aragón (0.04)
Modeling Urban Food Insecurity with Google Street View Images
F ood insecurity is a significant social and public health issue that plagues many urban metropolitan areas around the world. Existing approaches to identifying food insecurity rely primarily on qualitative and quantitative survey data, which is difficult to scale. This project seeks to explore the effectiveness of using street-level images in modeling food insecurity at the census tract level. T o do so, we propose a two-step process of feature extraction and gated attention for image aggregation. W e evaluate the effectiveness of our model by comparing against other model architectures, interpreting our learned weights, and performing a case study. While our model falls slightly short in terms of its predictive power, we believe our approach still has the potential to supplement existing methods of identifying food insecurity for urban planners and policymakers.
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom (0.04)
- Africa > Kenya > Nairobi City County > Nairobi (0.04)
Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge
Dionelis, Nikolaos, Longépé, Nicolas, Feliciotti, Alessandra, Marconcini, Mattia, Peressutti, Devis, Kadunc, Nika Oman, Park, JaeWan, Sinulingga, Hagai Raja, Immanuel, Steve Andreas, Tran, Ba, Arnold, Caroline
Estimating the construction year of buildings is of great importance for sustainability. Sustainable buildings minimize energy consumption and are a key part of responsible and sustainable urban planning and development to effectively combat climate change. By using Artificial Intelligence (AI) and recently proposed Transformer models, we are able to estimate the construction epoch of buildings from a multi-modal dataset. In this paper, we introduce a new benchmark multi-modal dataset, i.e. the Map your City Dataset (MyCD), containing top-view Very High Resolution (VHR) images, Earth Observation (EO) multi-spectral data from the Copernicus Sentinel-2 satellite constellation, and street-view images in many different cities in Europe, co-localized with respect to the building under study and labelled with the construction epoch. We assess EO generalization performance on new/ previously unseen cities that have been held-out from training and appear only during inference. In this work, we present the community-based data challenge we organized based on MyCD. The ESA AI4EO Challenge MapYourCity was opened in 2024 for 4 months. Here, we present the Top-4 performing models, and the main evaluation results. During inference, the performance of the models using both all three input modalities and only the two top-view modalities, i.e. without the street-view images, is examined. The evaluation results show that the models are effective and can achieve good performance on this difficult real-world task of estimating the age of buildings, even on previously unseen cities, as well as even using only the two top-view modalities (i.e. VHR and Sentinel-2) during inference.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > Middle East > Oman (0.05)
- Europe > Germany (0.04)
- (6 more...)
- Energy (0.50)
- Education > Educational Setting > Higher Education (0.47)
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
Zhou, Baichuan, Yang, Haote, Chen, Dairong, Ye, Junyan, Bai, Tianyi, Yu, Jinhua, Zhang, Songyang, Lin, Dahua, He, Conghui, Li, Weijia
Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic region-level urban tasks under singular views, leading to incomplete evaluations of LMMs' abilities in urban environments. To address these issues, we present UrBench, a comprehensive benchmark designed for evaluating LMMs in complex multi-view urban scenarios. UrBench contains 11.6K meticulously curated questions at both region-level and role-level that cover 4 task dimensions: Geo-Localization, Scene Reasoning, Scene Understanding, and Object Understanding, totaling 14 task types. In constructing UrBench, we utilize data from existing datasets and additionally collect data from 11 cities, creating new annotations using a cross-view detection-matching method. With these images and annotations, we then integrate LMM-based, rule-based, and human-based methods to construct large-scale high-quality questions. Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects. Even the best performing GPT-4o lags behind humans in most tasks, ranging from simple tasks such as counting to complex tasks such as orientation, localization and object attribute recognition, with an average performance gap of 17.4%. Our benchmark also reveals that LMMs exhibit inconsistent behaviors with different urban views, especially with respect to understanding cross-view relations. UrBench datasets and benchmark results will be publicly available at https://opendatalab.github.io/UrBench/.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.91)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model
Li, Ling, Ye, Yu, Jiang, Bingchuan, Zeng, Wei
This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.
- Europe > Austria > Vienna (0.14)
- Asia > China > Tibet Autonomous Region (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- (18 more...)
- Transportation > Ground > Road (0.68)
- Transportation > Infrastructure & Services (0.46)
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling
Hao, Xixuan, Chen, Wei, Yan, Yibo, Zhong, Siru, Wang, Kun, Wen, Qingsong, Liang, Yuxuan
Urban region profiling aims to learn a low-dimensional representation of a given urban area while preserving its characteristics, such as demographics, infrastructure, and economic activities, for urban planning and development. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place.Secondly, the lack of interpretability in pretrained models limits their utility in providing transparent evidence for urban planning. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, elevating interpretability in downstream applications by producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six urban indicator prediction tasks underscore its superior performance.
- Asia > China > Beijing > Beijing (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (7 more...)
- Transportation > Ground > Road (0.68)
- Banking & Finance > Economy (0.48)
- Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.35)
Bird's-Eye View to Street-View: A Survey
Bajbaa, Khawlah, Usman, Muhammad, Anwar, Saeed, Radwan, Ibrahim, Bais, Abdul
In recent years, street view imagery has grown to become one of the most important sources of geospatial data collection and urban analytics, which facilitates generating meaningful insights and assisting in decision-making. Synthesizing a street-view image from its corresponding satellite image is a challenging task due to the significant differences in appearance and viewpoint between the two domains. In this study, we screened 20 recent research papers to provide a thorough review of the state-of-the-art of how street-view images are synthesized from their corresponding satellite counterparts. The main findings are: (i) novel deep learning techniques are required for synthesizing more realistic and accurate street-view images; (ii) more datasets need to be collected for public usage; and (iii) more specific evaluation metrics need to be investigated for evaluating the generated images appropriately. We conclude that, due to applying outdated deep learning techniques, the recent literature failed to generate detailed and diverse street-view images.
- Asia > Middle East > Saudi Arabia > Eastern Province > Dhahran (0.14)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- (5 more...)
Visualizing Routes with AI-Discovered Street-View Patterns
Wu, Tsung Heng, Amiruzzaman, Md, Zhao, Ye, Bhati, Deepshikha, Yang, Jing
Street-level visual appearances play an important role in studying social systems, such as understanding the built environment, driving routes, and associated social and economic factors. It has not been integrated into a typical geographical visualization interface (e.g., map services) for planning driving routes. In this paper, we study this new visualization task with several new contributions. First, we experiment with a set of AI techniques and propose a solution of using semantic latent vectors for quantifying visual appearance features. Second, we calculate image similarities among a large set of street-view images and then discover spatial imagery patterns. Third, we integrate these discovered patterns into driving route planners with new visualization techniques. Finally, we present VivaRoutes, an interactive visualization prototype, to show how visualizations leveraged with these discovered patterns can help users effectively and interactively explore multiple routes. Furthermore, we conducted a user study to assess the usefulness and utility of VivaRoutes.
- North America > United States > New York > New York County > Manhattan (0.14)
- North America > United States > Ohio (0.04)
- North America > United States > North Carolina > Mecklenburg County > Charlotte (0.04)
- (3 more...)
- Research Report (1.00)
- Questionnaire & Opinion Survey (1.00)
Combining Deep Learning and Street View Imagery to Map Smallholder Crop Types
Soler, Jordi Laguarta, Friedel, Thomas, Wang, Sherrie
Accurate crop type maps are an essential source of information for monitoring yield progress at scale, projecting global crop production, and planning effective policies. To date, however, crop type maps remain challenging to create in low and middle-income countries due to a lack of ground truth labels for training machine learning models. Field surveys are the gold standard in terms of accuracy but require an often-prohibitively large amount of time, money, and statistical capacity. In recent years, street-level imagery, such as Google Street View, KartaView, and Mapillary, has become available around the world. Such imagery contains rich information about crop types grown at particular locations and times. In this work, we develop an automated system to generate crop type ground references using deep learning and Google Street View imagery. The method efficiently curates a set of street view images containing crop fields, trains a model to predict crop type by utilizing weakly-labelled images from disparate out-of-domain sources, and combines predicted labels with remote sensing time series to create a wall-to-wall crop type map. We show that, in Thailand, the resulting country-wide map of rice, cassava, maize, and sugarcane achieves an accuracy of 93%. We publicly release the first-ever crop type map for all of Thailand 2022 at 10m-resolution with no gaps. To our knowledge, this is the first time a 10m-resolution, multi-crop map has been created for any smallholder country. As the availability of roadside imagery expands, our pipeline provides a way to map crop types at scale around the globe, especially in underserved smallholder regions.
- Asia > Thailand (0.47)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Asia > India (0.04)
- (8 more...)